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Abstract 



The amino acid sequences of proteins provide rich information for inferring distant phylogenetic 
relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from 
amino acid sequences is also important because the rate matrix can be used to develop scoring matrices 
for sequence alignment. Here we use a continuous time Markov process to model the substitution rates 
of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate 
our method using simulated artificial protein sequences. Because different local regions such as binding 
surfaces and the protein interior core experience different selection pressures due to functional or stability 
constraints, we use our method to estimate the substitution rates of local regions. Our results show 
that the substitution rates are very different for residues in the buried core and residues on the solvent 
exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different 
substitution rates from residues. Based on these findings, we further develop a method for protein 
function prediction by surface matching using scoring matrices derived from estimated substitution rates 
for residues located on the binding surfaces. We show with examples that our method is effective in 
identifying functionally related proteins that have overall low sequence identity, a task known to be very 
challenging. 



Introduction 



Amino acid sequences are an important source of information for inferring distant phylogenetic 
relationships and for predicting the biochemical functions of protein. Because the substitutions of 
nucleotides can become rapidly saturated, and the likelihood of unrelated identical substitutions is 
high for nucleotides, the information of evolutionary conservation of nucleotides is quickly obscured 
after a number of generations. The mapping of DNA sequences by the genetic code to amino 
acid sequences freq uentlv can reveal more r emote evolutionary relation with more interpretable 
sequence similarity ( Lio and GoldmanL 1999h . In addition, statistical analysis of protein sequence 
alignment is also more reliable, as it is much more difficult to detect and correct for deviations 
from independent identical distributions in DNA sequences due to possible translation of normal 
complexity DNA sequences into low comp lexity protein sequences such as tandem repeats of simple 



patterns of a few residues ( Pearsonl . 



The success in detecting evolutionarily related protein sequences through sequence alignment 
depends on the use of a scoring matrix, which determines the similarity between residues. Rate 
matrices of amino acid residue substitutions ca n be the basis for the developing of ma ny scoring ma- 
trices for sequence alignment. Dayhoff et al. (jDavhoff. Schwartz, and Orcutt . were the first 
to develop empirical models of amino acid residue substitutions. They used a counting method 
to obtain accepted point mutation matrices (called Pam matrices). The widely used Blosum 
matrices can be viewed as analogous to transition ma trices of residues at different time intervals 
( Henikoff and Henikofj . 19921 : Lio and Goldmanl . Il998l l . They were developed following a heuris- 
tic counting approach similar to that of P am, and were derived from structure-based alignments 
of blocks of sequences of related proteins ( Henikoff and Henikofj . 1992h . Both Pam and Blosum 
matrices are widel y used for sequence alignment (e.Q., in software tools such as Fasta, B last 
and Clustal W) dAltschul et al.lll99nl : |Pearsonlll990l : Irhompson. Higgins. and Gibsonlll994h . An 
update of the Pam matrices based on the same counting approach using a much enlarged database 
is the Jones- Taylo r-Thornton (Jtt) amino acid substitution matrix, which is widely used for ph y- 
logenetic analysis ( Jones. Tavlor. and Thorntxml 19921 : Adachi and Hasegawal . W9(i : Yangl . 19971 ). 
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Whelan and Goldman pointed out that these counting methods are effectively equivalent to 
the maximum parsimony method, and therefore suffer from several drawbacks: the systematic un- 
derestimation of substitutions in certain branches of a ph ylogenv and the inefficiency in using all 
information contained in the amino acid residue sequences () Whelan and Go ldman. 2001) . This can 
be a serious problem for applications such as inferring protein functions from a protein sequence, as 
the number of sequence homologs available for multiple sequence alignment is often limited. In ad- 
dition, matrices such as Pam and Blosum have implicit parameters whose values were determined 
from the precomputed analysis of large quantities of sequences, while the information of the particu- 
lar protein of interest has limited or no influence. A more effective approach for studying amino acid 
residue substitution s is to emplov an explicit continuous time Markov model based on a phylogenetic 
tree of the protein ( Yang. Nielsen, and Haseeawal . Il998l : IWhelan and GoldmanL l200lh . Markovian 
evolutionary models are parametric models and do not have pr e-specified parameter values. Thes e 
values are estimated from data specific to the protein of interest ( Whelan. Lio. and Goldmanl . 2001 ). 
Recent work using this approach has shown that more informative rate matrices can be derived . 
with significant advantages over matrices obtained from counting method ( Whelan and Goldmanl . 
l200lh . 

Despite these important results, current studies of the substitution rates of amino acid residues 
are based on the assumption that the whole protein sequence experience similar selection pressure 
and therefore have the same substitution rates. There is no distinction for different regions of 
proteins, namely, all sites have the same evolutionary rates. This is an unrealistic assumption. For 
example, regions that directly participate in biochemical functions, such as binding surfaces, are 
likely to experience very different selection pressure from other regions. In the protein interior, 
hydrophobic amino acid residues may be conserved not due to their functional roles, but due 
to the constraints of m ainta i ning protein stability, as hydroph o bic interactions are the driving 
force of prote i n fold ing (jPilll . \l99(\ : Govindaraian and Goldsteinl . 1997 ; Parisi and Echavj . 2001al : 



Li and LianI bOQ^ ). Similarly, residues in the transmembrane segmen ts of membrane proteins 



experience different se lection pressure from soluble parts of the proteins ()Li6 and Goldmanl . IT999I : 
Tourasse and Lil2000l ). It is therefore important to study region-specific residue replacement rates. 

An important advance in the recons truction of phylogenv is the considera tion of heterogenous 
substitution rates among different sites ( Yang et al. l200nl : Mavrose et al. 2004h . However, these are 
based on substitution models of either nucleotides or codons, with sometimes discretized categories 
of rates. Because of the large number of parameters due to an alphabet size of 20 for amino acid 
residues, it is impractical to estimate site-specific rates for amino acid residue sequences. 

In this study, we use a continuous time Markov model to estimate residue substitution rates 
for spatially defined regions of proteins based on known three-dimensional structures of proteins 

( Liang. Edelsbrunner. and Woodwardl . 1998bl : Binkowski. Adamian. and Liang | j_ 2003a ) . Different 

from previous studies of rate estimation based on maximum likelihood rnethods dPelsenstein and Churchil] . 

ll99fil : lYang. Nielsen and Hasegawal . ll99Sl : Iwhelan and Goldmanl . I2OO1I : ISienel and Haniss1eT] . l20o'4i r 

we develop a Bayesian method to estimate the posterior mean values of the instantaneous rates 

of residue substitution. Our approach is based on the te chnique of Markov chain Monte Carlo, a 



method that has been widely used in phvlogenet ic analysis (|Yang and R,annalal . ll997l : lMau. Newton, and Larget . 



Ti^: lHue]senbeck. Rannala. and Lairge^ B. To derive well defined spatia~ions of proteins 



which are formed by resi dues well separated in primary seq uences, we rely on computational analy- 
sis of protein structures ((Liang. Edelsbrunner. and Woodward. .1998b ) . In our study, these distant 
residues in sequences are spatial neighbors that participate in direct molecular binding events, and 
can be regarded as belonging to the same class of substitution rates. Our study is also moti- 
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vated by the need to deduce related functions from protein structures, i.e., to identify functionally 
related protein structures. As structural biology proceeds, there is an increasing number of pro- 
teins whose atomic str uctures are resolved, yet their biological functions are completely unknown 
( Sanishvih et al.l . 1200.'^ ) . 



Our results show that residue substitution rates are significantly different for different regions 
of the proteins, e.g., for the buried protein core, solvent exposed surfaces, and specific binding 
surfaces on protein structures. We also develop a novel method for inferring protein functions. 
Using residue similarity scoring matrices derived from estimated substitution rates for protein 
surfaces, our method is far more effective than several other methods in detecting similar binding 
surface that are functionally related from different protein structures. This is a challenging task, 
as it is well known that function predictio n becomes difficult when the sequence identity between 
two proteins is below 60-70% (Rost . 20oi lTia,ri a.rid Sk olnick. 20^). 

This paper is organized as follows. We first describe the continuous time Markov model for 
residue substitution rates. We then discuss how to compute the likelihood function of substitution 
rate matrices given a specific phylogeny and a multiple sequence alignment. The Markov chain 
Monte Carlo method is then briefly described, including the design of move sets that helps to 
improve the rate of mixing. We then describe simulation results in estimating substitution rates. 
This is followed by discussion of the results of different substitution rates estimated for different 
regions of a set of proteins. We then give examples to show how residue scoring matrices derived 
from the estimated rate matrix can improve detection of functionally related proteins. 



Model and Methods 



Continuous time Markov process for residue substitution. 

For a given ph ylogenetic t r ee, we use a revers ible continuous time Markov process as our evolu- 
tionary model ( Felsensteinl . 1981 : Yangl . 1994al ). This model has several advantages over empirical 
methods. For example, Markovian evolutionary models are parametric models and do not have 
pre-spec ified parameter values. Th ese values are all estimated from data specific to the protein of 
interest ( Whelan. Lio. and Goldm an. 200^). In addition, previous works showed that the effects of 
secondary structure and solvent accessibility are important for protein evolution, and such effects 
can be captured by a Markovi an evolutionarv model, while it is difficult for empirical methods to 
take thes e effect s into account ( Goldman. Thorne. and Joneslll996bl . 1998bl : Lio and Goldmanlll999l : 
iRnbinso n et a].l En03l . 

Once the tree topology and the time intervals of sequence divergence {t} (or the branch lengths) 
of the phylogenetic tree are known, the parameters of the model are the 20 x 20 rate matrix Q 
for the 20 amino acid residues. Because substitution rate and divergence time t are co nfounded, t 
cannot be expressed in absolute units. We follow the approach of ( Adac hi and Hasegaw a. 1996.) to 
represent the divergence time t as the expected number of residue changes per 100 sites between the 
sequences. The entries qij of matrix Q are substitution rates of amino acid residues for the set A of 
20 amino acid residues at an infinitesimally small time interval. Specifically, we have: Q = {qij}, 
where the diag onal element is = — V! ,- q,- -i. The transition probability matrix of size 20 x 20 
after time t is ( Lio and Goldmam Il998l l: 



P(t) = {py(t)} = P(0)exp(g-t), 
where -P(O) = /. Here Pij{t) represents the probability that a residue of type i will mutate into a 
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residue of type j after time t. To ensure that the n onsymmetric rate matr i x Q is diagonalizable for 
easy computation of P{t), we follow the reference ( Whelan and Goldmanl . 2001 ) and insist that Q 
takes the form of Q = SD, where D isa diagonal matrix who entries are the composition of residues 
from the region of interest on the protein structure, and S is a symmetric matrix whose entries 
need to be estimated. Because symmetric S is diagonalizable as S = VAV'^, the matrix Q = 
S D = D^/'^VAV^D^^^^ is also diagonalizable, hence P{t) = P{0){D^^^V) e^p{At){V'^ D-'^/'^). 



Likelihood function of a fixed phylogeny. 

For node k and node I separated by divergence time tki, the time reversible probability of observing 
residue Xk in a position h at node k and residue xi of the same position at node / is: 

T^XkPxkXiitkl) = TTxiPxiXkitkl). 

For a set 5 of s multiple-aligned sequences {xi,X2, ■ ■ ■ ,Xs) of length n amino acid residues in a 
specific region, we assume that a reasonably accurate phylogenetic tree T = (V, £) of the proteins 
is given. Here V is the set of nodes, namely, the union of the set of observed s sequences C (leaf 
nodes), and the set of s — 1 ancestral sequences I (internal nodes). £ is the set of edges of the 
tree. Let the vector Xh = {xi, ■ ■ ■ ,Xs)'^ be the observed residues at position h for the s sequences, 
h ranges from 1 to n. Without loss of generality, we assume that the root of the phylogenetic tree 
is an internal node k. Given the specified topology of the phylogenetic tree T and the set of edges, 
the probability of observing s number of residues at position h is: 

p{Xh\T,Q) =TT^^Y^ p^^^.(tij). 

after summing over the set A of all possible residue types for the internal nodes I. The probability 
P{S\T, Q) of observing all residues in the functional region is: 

n 

P{S\T, Q) = P{xi, ,Xs\T,Q) = ll p{xh\T, Q). 

h=l 

This can be used to calculate the log-likelihood function £ = logP(5|T, Q). 



Bayesian estimation of instantaneous rates. 

Our goal is to estimate the values of the Q matrix. The continuous time Markov model for 
residue^ substitutions ha s been im plemented in several studies using maximum likelihood estima- 
tion (iYanei. il994a,: Whelan and G oldman. 2001), and has also been applied in a protein folding 
study (jTseng and Lian el. l2004l V Different from these prior studies, here we adopt a Bayesian ap- 



proach. We use a prior distribution 7r{Q) to encode our past knowledge of amino acid substitution 
rates for proteins. We describe the instantaneous substitution rate Q = {qij} by a posterior dis- 
tribution Tr{Q\S,T), which summarizes prior information available on the rates Q = {qij} and the 
information contained in the observations S and T. After integrating the prior information and 
the likelihood function, the posterior distribution 7r(Q|5,T) can be estimated up to a constant as: 



7r(Q|5,T)oc j P{S\T,Q)-Tr{Q)dQ. 
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Our goal is to estimate the posterior means of rates in Q as summarizing indice: 

E^(Q) = j Q ■Tr{Q\S,T)dQ. 
In this study, we use uniform uninformative priors. Others choices are also possible. 

Markov chain Monte Carlo. 

We run a Markov chain to generate samples drawn from the target distribution 7r(Q|5, T). Starting 
from a rate matrix at time t, we generate a new rate matrix Qt^i using the proposal function: 
T{Qt, Qt+i)- The proposed new matrix Q^+i will be either accepted or rejected, depending on the 
outcome of an acceptance rule ?"(Qt;Qt+i)- Equivalently, we have: 

To ensure that the Markov chain will reach stationary state, we need to satisfy the requirement of 
detailed balance, i.e., 

7r(g,|5,T) ■ A(Q„Q,+i) = ^(Q,+i|5,T) • ^(Q,+i,Q,). 

This is achieved by using the Metropolis-Hastings acceptance ratio r(Qj,Qt+i) to either accept or 
reject Qt+i, depending on whether the following inequality holds: 

«<r(Q„Q,+,)-mm|l, ,(^^,5^ • T(Q„ Q,^,) ^' 

where u is a random number drawn from the uniform distribution Z//[ 0,1]. With the assumption tha t 
the underlying Markov process is ergodic, irreducible, and aperiodic (jCrimmett and Stizakeil. 2001 ), 



a Ma rkov chain generated following these rules will reach the stationary state ( Robert and Casellal . 
l2004h . 

We collect m correlated samples of the Q matrix after the Markov chain has reached its sta- 
tionary state. The posterior means of the rate matrix are then estimated as: 



E,((5)«^Q,.7r(Q,|5,T) 



i=l 

Move set. 

A move set determines the proposal function T{Q^, Qi_^i), which is critical for the rapid convergency 
of a Markov chain. To improve mixing, we design two type of moves for proposing a new rate matrix 
Qtj^i from a previous matrix Q^. When the state variable s for these two types of moves takes the 
value s = 1, we take Type 1 move. When the state s = 2, we take Type 2 move. For Type 1 moves, 
a single entry of the rate matrix with index ij is randomly chosen, and with equal probability we 
assign: 

qij,t+i = aiqij,t or qij,t+i = a2qij,t, 

where ai = 0.9, and 02 = 1.1. For Type 2 moves, we use a simplified residue alphabet of 
size 5 to represent the 2 amino acid residue types, based on the analysis described in reference 
( Li. Hu. and Liand . l200.'j ). The five residue types are: {G, A, V, L, I, P}, {F, Y, W}, {S, T, C, M, N, Q}, 



6 



{D, E}, and {K, R, H}. We select one of the 5 reduced residue types following U[l, 2, • • • ,5], and 
scale with equal probability all entries in Q corresponding to the residues contained in one of the 
simplified residue type, with a constant of either ai = 0.9 or 02 = 1-1 at equal probability. The 
transition between these two types of moves is determined by the transition matrix: 

siA ^ (0.9 0.1\ 
Wi 82,2) \0.9 O.lJ ■ 

Overall, the acceptance ratio of Type 1 moves is 50% — 66%, and the acceptance ratio of Type 2 
move is < 10%. 



Rate matrix Q and residue similarity score. 



To derive residue similarity scoring matrices for sequence alignm ents and database searche s from 
the evolutionary model, we calculate the residue similarity scores ( Karlin and Altschull . 1990l ) bij{t) 
between residues i and j at different evolutionary time t from the rate matrix Q: 



bijit) = - log 



A 



log 



rriijit) 



where mij{t) is joint probability of observing both residue type i and j at the two nodes separated 

1, because of the 



by time t, and A is a scalar. Here bij{t) satisfies the equality ^vTjvrje 
property of the joint probability J^.^jmad) = TTiPiift) = y^ „- tTj = 1 holds for Markov matrix 
which has the property ^ ■ Pij{t) (jOrimmett and S tizakeA 2001 ). The overall ex pected score of this 



matrix is then "^^j 'rnij{t)bij{t), usually in bit units (,Karlin and Altschull . 199d ). 



Computation of surface pockets and interior voids. 

We use the VOLBL method to compute the solvent accessible surface a rea of protein structures 

tedelsbrunner et al]ll995l: Liang et al.ll998al ') . We u se the CastP method ( Liang. Edelsbrunner. and Woodwardl . 



1998bl : IRinkowski. Naghibzadeh. and Liand . l2003b l^ to identify residues located on surface pock- 



ets. Both VOLBL and CastP are based on precomputed alpha shapes ( Edelsbrunner and Miickj . 
I1994I where the dual simplicial complex is constructed from the Delaunav triangulation of the 
atom i c coordinates of the protein. We use the pock et algorithm ( Edelsbrunner. Facello. and Lian'3 . 
Il998l : iLiang. Edelsbninner. and Woodwardl . 1998bl ) in CastP to identify residues located in sur- 
face pockets and interior voids. Details and other applications of these methods can be found in 



dLiang. Edelsbrunner. and Woodwar4ll998b|:lEdelsbrunner. Facello. and Liangl.ll998l:lLiang and Dill 



2OOII : iBinkowski. Adamian. and LiangL 12003 



Results 

There is a large number of parameters (189) characterizing the substitutions of amino acid residues. 
We first need to understand at what accuracy these parameters can be estimated. Because we are 
studying regions {e.g., binding surfaces) on a protein structure, we often only have a few dozen 
instead of a few hundred residue positions available for parameter estimation. In addition, we are 
frequently limited by the available sequence data, and the size of the phylogenetic tree may be 
moderate. Even if the parameters of the substitution model can be estimated, it is not clear how 
effective they are for applications such as inferring protein functions from protein structures. We 
describe our results addressing each of these issues. 
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Rate estimation: simulation studies. 

We first carry out a simulation study to test the accuracy of the estimated residue substitution rates. 
We generate a set of artificial sequences based on an evolutionary model with known substitution 
rates. We ask whether our method can recover the original substitution rates reasonably well, 
and how many sequences and residues are necessary so an accurate estimation can be made. For 
this purpose, we first take the sequence of the alpha-catalytic subunit of cAMP-dependent protein 
kinase (SwissProt P36887, pdb Icdk, with length 343), and the sequence of carboxypeptidase A2 
precursor (SwissProt P48052, pdb laye, length 417). 



Statistic s for estimation accuracy. W e use the Jones- Taylor-Thornton (Jtt) evolution- 



acv; V 

ary model (|,Tones. Tavlor. and ThorntoiJ . ll992l ). which is characterized by a frequency- independent 



amino acid interconversion rate matrix <S'j'p^ and the diagonal matrix D of the composition of 
the 20 amino acid residues for the set of sequences that were used to derive the original Jtt model 
(|Yana . 1997.}. The substitution rate matrix Qjtt is then: Qjtt — Sjtt ■ D. To avoid potential 
bias, we use the composition D of the protein kinase and the frequency-independent amino acid 
interconversion rate matrix of Sjtt to calculate the instantaneous rate matrix Q for the protein 
kinase, which is then used to generate 16 artificial kinase sequences at different time intervals t using 
the probability P{t) = exp(Qt)I. Here we use a simple balanced phylogenetic tree of 16 leaf nodes 
with equal branch lengths of t = 0.1 for all edges. We compare the estimated frequency-independent 
amino acid interconversion rate matrix S to the true matrix Sjtt- 

For comparison, we first normalized the estimated and true Jtt frequency-independent inter- 
conversion rate matrices, such that: 

-t: / Sii = 1 and — > sa = 1, 

where Sij is the (z,j)-th entry of the matrix S. 

We are interested in the rates of substitution that occur in a specific spatial region of the protein. 
Because these regions contain only a subset of the residues and often are under different selection 
pressure, not all possible substitutions are observed with adequate frequency for estimation. In 
addition, the usually moderate size of the phylogenetic tree limits the observed frequency of some 
substitutions. Nevertheless, the frequently observed substitutions for a specific protein region are 
likely to be the most important ones, and the estimation of their rates should be better than rates 
of infrequently observed substitutions. 

We need to quantitatively assess our estimation error. Because it is very difficult to estimate 
accurately the absolute values of the individual rates, we assess instead the errors in estimated Sij 
in terms of their effects on the overall patterns of residue substitution on a specific protein region. 
This is more appropriate for many applications such as the analysis of the evolution of binding 
surfaces and the evolution of the folding core, as only a subset of substitutions occur at a functional 
surface or in the core. We develop some quantitative measures for this purpose. 

We call a residue pair an occurring pair if both residues i and j occur simultaneously in 
one column of the multiple aligned sequences of a specific region. For the subset of rates S = {sij} 
for a residue pair from the set of occurring pairs V, we obtain the relative contribution of a 
specific frequency-independent interconversion rate between a pair of residues as: 

■^ij ~ '^ij/ ^ij- 
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The Acij weighted error in contribution is computed as: 



Ae,; 



where Sij is the estimated value of s^ -, fij is the number count of how often the (i, j) substitutions 



occur. 



To measure the overah differences of the estimated S and the original 'S' j^'j matrices for the 
occurring substitutions, we use the weighted mean square error (MSE-p) ()Mavrose et al.l . l2nn4h : 

mset. ^ Yl 

ijev 



Error analysis in estimated rates. Using the 16 artificial sequences generated from the 
sequence of carboxypeptidase and a simple balanced phylogenetic tree with equal branch length 
t = 0.1 for all edges between nodes, the Markov chain converges quickly after 3 x 10^ Monte Carlo 
steps (Figure^), as shown by the value of —£ for the negative likelihood function. After a burning- 
in period of 3 x 10^ steps, we collect m = 4 x 10^ samples for estimating {sij} values. Figure^ 
shows the estimation results for two simulations started from two different sets of initial values 
of {sij}. It is clear that both sets of estimated rates {sij} for the occurring pairs are in general 
agreement to the set of true values from the Jtt model. 

To further assess how robust the estimations are, we repeated the Markov chain Monte Carlo 
simulation 50 times using random initial values of {sij} drawn from a uniform distribution of 
Z//(0, 1). On average, the estimation error is small. The mean of the overall weighted MSE-p from 
50 simulations is 5.2 x 10~^ for occurring pairs (Figure^). 

Length dependency of errors in estimated parameters. To estimate region specific 
substitution rates, it is important to assess how the accuracy of the estimation depends on the 
size of the region. For example, the functional region of a protein contains only a small number of 
residues, which varies depending on the size of the binding site. We carry out another simulation 
study for this purpose. Starting from the N-termini of the 16 artificially generated carboxypeptidase 
sequences, we take a subsequence from each sequence, with the length increasing from 10 to 417, 
at an increment of 10 residues. We then estimate the substitution rates at each length. Each 
simulation of a different length was started from a random set of initial values drawn from U{0, 1), 
and the same burning-in period and sample size m are used as before. The MSE-p values obtained 
using sequences of different lengths are plotted in Figure ^i- Our results show that for this set of 
sequences, as long as the number of residues is > 20, the MSEp of the estimated parameters will 
be less than 0.008. 

Based on analysis of the protein structures in the Protein Data Bank, we found that among 
the surface pockets from 6,273 protein structures that all contain annotated functional residues (as 
recorded either in the Feature table of the SwissProt database or the Active Site field of the 
Pdb file), the average size of a functional site pocket is 35 residues, and the median is 23 residues 
(Figure 12^). This suggests that our method will be applicable for the analysis of protein functional 
pockets. 

We carried out another simulation study estimating substitution rates only for the binding 
surface of a protein. Using the same phylogenetic tree as that of the carboxypeptidase simulations 
and the same Jtt model, we generate 16 artificial sequences of the alpha-catalytic subunit of cAMP- 
dependent protein kinase (SwissProt P36887, pdb Icdk, length 343). Our goal is to estimate rates 
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Figure 1: Estimating residue substitution rates using simulated carboxypeptidase sequences, 
(a) The Markov chain converges after 3 x 10^. The insert shows negative log likelihood {—i) 
values in stationary state after the burning-in period; (b) Sij values estimated in two simulations 
are all similar to the true rates. In the first simulation, the 189 initial values are set such 
that Sij = 0.1 for all entries. In the second simulation, the 189 initial Sij values are sorted 
numerically by index i then by index j, and the values are assigned from 0.1 with an increment 
of 0.01 for the next entry, (c) The MSE-p values from 50 repeated estimations of substitution 
rates of carboxypeptidase with random initial values are all less than 8 x 10~'^. The mean 
value of MSE-p is 5.2 x 10~^. (d) The value of MSE-p depends on the length of available 
subsequences. For subsequence of length > 20, the MSEp value is < 0.008. 



only for the subset of 38 residues located in the binding site. Figure OK shows that the MSE-p values 
of the estimated rates from 110 independent simulations for the 90 occurring pairs of residues are 
all small. The estimated rates from all simulations have MSE-p < 8 x 10~^, and the mean of the 
overall MSEp from 110 simulations is 4.8 x 10^'^ for the 90 occurring pairs. Clearly, the estimation 
errors measured in MSE-p are larger when only residues in the binding site are used compared 
to the estimation errors of carboxypeptidase where all 417 residues are used. Nevertheless, the 
estimations are still useful, as the mean MSEp value remains small. Figure plots the individual 
mean value of weighted errors Acjj for the 90 occurring pairs obtained from 110 simulations. There 
are only 4 substitutions whose weighted error in contribution Acjj is greater than 3% , although 
all occurring pairs have Ae^j < 4.5%. 
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Figure 2: The length distribution and amino acid composition of functional pockets, (a) The 
length distribution of 6,273 functional pockets. The average length of functional pockets is 35 
residues, and the median is 23 residues, (b) Comparison of amino acid compositions of residues 
in 6, 273 functional pockets with the composition of 16, 300 protein sequences used to derive the 
Jtt substitution matrix. The dashed line is the expected probability of 0.05 if all substitution 
rates following the uniform distribution. 
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Figure 3: Estimating the substitution rates of residues on the binding surface of cAMP- 
dependent protein kinase from simulated sequences, (a) For 110 independent estimations of the 
substitution rates with random initial values, the MSE-p values are all < 8 x 10^^. The mean 
MSE-p value of the 110 estimations is 0.0048. (b) There are only 4 substitutions (empty circles) 
whose error Ae^ is great than 3.0%, although all of the 90 occurring pairs have Aejj < 4.5%. 



Evolutionary rates are region specific. 

Exposed surface and buried interior have different substitution rates. Residues on 
protein surfaces that are exposed to solvent are under different physicochemical constraints from 
residues in the buried interior. We estimate the substitution rates for exposed and buried regions on 

a protein structure. We use a simple criterion to classify residues as ei ther exposed or buried: Based 

on the calculation of solvent accessible (SA) surface area using VOLBL (jLiang. Edelsbrunner. Fu. Sudhakar. and Sul 
1998al ). we declare a residue to be buried if its SA area is A^, and exposed if SA area > A^. 



For the protein 2-haloacid dehalogenase (pdb lqh9), Figure 0] shows that the residues on the 
exposed surfaces and in the buried interior have very different substitution patterns. For example. 
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Figure 4: Substitution rates of residues on solvent exposed surface and in buried interior, (a) 
Substitution rates of buried interior residue on 2-haloacid dehalogenase (pdb lqh9). There are 
100 occurring pairs, (b) substitution rates of surface exposed residues of lqh9. There are 188 
occurring pairs, (c) Substitution rates of buried interior residue of alpha amylase (pdb Ibag). 
There are 190 occurring pairs, (d) substitution rates of surface exposed residues of Ibag. There 
are 177 occurring pairs. 



the substitution of Threonine (T) with Asparagine (N), Aspartate (D), or Glutamine(Q) occurs 
much more frequently in the buried interior than on the surface (Figure [IJi and Figure A 
similar pattern is also seen for alpha amylase (pdb Ibag, Figure lit and Figure^). In general, 
ionizable and polar residues in the protein interior have higher propensities to mutate to other 
ionizable and polar residues. 

The frequent substitutions between T and {N, D, Q} observed in the protein interior of 1-2- 
haloacid dehalogenase and amylase suggest that to maintain the H-bonding interactions in the 
protein interior, it is far more common to have substitutions among ionizable residues and polar 
residues. These substitution patterns point to the importance of preserving polar interactions, 
which provide important structural stability in the protein interior, as the high dielectric constants 
inside proteins makes the electrostatic contribution of salt-bridges and H-bonds in the protein 
interior stronger than H-bonds on protein surfaces. 

The conclusion that residues in the protein interior experience different selection pressure from 
residues on the protein surfaces are likely to be true for other proteins. We estimated the substitu- 
tion rates of buried residues and exposed residues for 6 additional proteins with different biological 
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Table 1 

Substitutions rate of residues in the interior and on the exposed surface 
are different. 



Protein Family pdb Interior Surface p— value of K-S test 

occurring pairs occurring pairs 



EC 3.4.11.18 


lb6a 


80 


175 


0.016 


EC 3.2.1.1 


Ibag 


190 


177 


0.015 


EC 2.3.3.1 


Icsc 


55 


163 


0.009 


EC 3.8.1.2 


lqh9 


139 


169 


0.023 


EC 3.2.1.21 


lh49 


60 


169 


0.024 


EC 3.5.1.5 


ludp 


92 


162 


0.014 


EC 1.1.1.37 


IbSv 


97 


150 


4.8 X 10 



functions as indicated by different enzyme classification numbers (Table^). In all cases, we find that 
surface residues have different evolutionary patterns overall. Although not all substitution rates 
are noticeably different, Table ^ shows that for each of the 8 proteins studied, we can reject the 
null hypothesis, based on the nonparametric Kolmogorov-Smirnov test, that the two distributions 
of substitution rates for the set of exposed residues and the set of buried residues are the same. 

Residues in functional sites and on the rest of the surface have different sub- 
stitution rates. Protein functional sites are the regions where a protein interacts with ligand, 
substrate, or other molecules. Because proteins fold into their three-dimensional native struc- 
tures, functional sites often involve residues that are distant in sequence but are in spatial prox- 
imity. As can be seen in Figure El two proteins with a low sequence identity (< 16%) may be 
very different overall, but their functional binding pockets may be quite similar. In this study, 
we use the CastP database of precomputed surface pockets for our analysis of functional sites 
on protein structures. This approach has been applied in studies of protein functi on prediction 
( Binkowski. Adamian. and Liang| . 2f)n3al : Binkowski. Naghibzadeh. and Liand . 2nn3bl ) and in struc- 



tural analysis of non-synonymous SNPs istitziel et al.l . bnn.ll 'l 



Residues that are located in functional pockets are under different selection pressures. This 
can be clearly seen in Figure [Jb, such that the composition of residues in functional pockets is 
very different from the composition of residues in the set of full protein sequences from which the 
Jtt substitution matrix was derived. Here we examine only protein surface pockets that contains 
functionally important residues as annotated by either SwissProt or Pdb. In functional pockets, 
Tyr, Trp, His, Asp and Gly residues are far more enriched, but Leu, Ser, and Ala are less if compared 
to sequences used in the Jtt rate matrix analysis. Tyr, Trp, His and Asp are residues that play 
important roles in enzyme reactions through electrostatic interactions, change of protonation states, 
and aromatic interactions. Gly residues are important in the formation of turns and other geometric 
features for binding site formation. The enrichment of hydrophobic Leu and small residues Ser and 
Ala in the full sequence are probably important for structural stability. 

We examine the patterns of residue substitutions on protein functional surfaces in some detail. 
Taking a structure of alpha amylase (pdb Ibag) as an example, we compare the estimated substi- 
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(e) 

LGTGSFGRVAKLKVLQHTELVMMEYV - - - EDKENLTDF 
49 50 51 52 53 54 55 56 57 70 72 74 78 79 82 84 87 88 91 95 104 118 120 121 122 123 127 166 168 170 171 173 183 184 185 

LGQGCFGEVA-IKLMFAMVLVITEYMGSLDDRANLADF 
273 274 275 276 277 278 279 280 281 293 294 295 297 302 307 311 314323 325 328 336 338 339 340 341 344 345 347 348 386 388 390 391 393 403 404 405 

Figure 5: Protein functional pockets of kinases. Functional site of (a) the catalytic subunit of 
cAMP-dependent protein kinase (icdk chain A), and (b) tyrosine protein kinase c-src (2src). 
Both kinases bind to AMP or AMP analogs. Their global primary sequence identity is as low 
as 16%. However, if we extract their binding surfaces (as shown in (c) and (d)) out, (e) the 
residues forming the binding pockets have much a higher sequence identity (51%). 

tution rate matrix of functional surface residues with that of the remaining surface residues of the 
protein (Figure EJ- It is clear that the selection pressures for residues located in functional site and 
for residues on the rest of the protein surface are different, and they are also both different from the 
Jtt matrix (data not shown). This suggests that identifying functionally related protein surfaces 
will be more effective if we employ scoring matrices specifically derived from residues located on 
functional surface instead of using a general precomputed substitution matrix. 
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Figure 6: Substitution rates of residues in the functional binding surface and the remaining 
surface of alpha-amylase (pdb Ibag). (a) Substitution rates of the functional binding surface. 
There are 39 occurring pairs, (b) substitution rates of the remaining surface on Ibag. There 
are 177 occurring pairs. 



Application: Detecting functionally similar biochemical binding 
surfaces. 

For proteins carrying out similar functions such as binding similar substrates and catalyzing similar 
chemical reactions, the binding surfaces experience similar physical and chemical constraints. The 
sets of allowed and forbidden substitutions will therefore be similar because of these constraints. 
The continuous time Markov model can provide evolutionary information at different time intervals 
once the instantaneous substitution rates are estimated. This information is encoded in the time- 
dependent residue substitution probabilities. An objective test of the utility of the estimated 
evolutionary model is to examine if we can discover functionally related proteins, namely, whether 
we can identify protein structures that have similar binding surfaces and carry out similar biological 
functions. 



Identification of functionally related proteins from a template binding surface. 

We use alpha-amylases as our test system. Alpha-amylase (Enzyme Classification number E.G. 3. 2. 1.1) 
acts on starch, glycogen and related polysaccharides and oligosaccharides. Detecting functionally 
related alpha amylase is a challenging task, as many of them have very low overall sequence identi- 
ties (< 25%) to the query protein template. If two proteins have a sequence ident ity be l ow 60 — 70%, 
it becomes difficult to make functional inferences based on sequence alignment ()R,ost . l2002h . 

Given a template binding surface from an alpha amylase (ibag, pdb), we wish to know how 
many protein structures can be identified that have the same enzyme classification (E.G.) number 
at an accuracy of all four E.G. digits. These protein structures all carry out the same or related 
reactions. By the convention of the Enzyme Glassification system, the E.G. numbers represent a 
progressively finer classification of the enzyme, with the first digit about the basic reaction, and 
the last digit often about the specific functional group that is cleaved during reactioii; 



We first exhaustively com pute all of the voids and pockets on this protein structure ((Liang. Edelsbrunner. and \^ 
1998bl : iBi nkowski. Naghibza deh. and Lianel . 2003bl ) . Based on biological annotation contained in 



the Protein Data Bank, the 60th pocket containing 18 residues is identified as the functional site 
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Figure 7: Function prediction of alpha amylases, (a) The phylogenetic tree for Pdb structure 
Ibag from B. subtilis. (b) The functional binding pocket of alpha amylase on Ibag. (c) 
A matched binding surface on a different protein structure (lb2y from human, full sequence 
identity 22%) obtained by querying with the binding surface of Ibag. (d) The phylogenetic tree 
for lbg9 from H. vulgare. (e) The binding pocket on lbg9. (f) A matched binding surface on a 
different protein structure (lu2y from human, full sequence identity 23%) obtained by querying 
with lbg9. 



(Figure I7|o) . To construct an evoluti onary model, we use sequence alignment tools to gather se- 
quences homologous to that of Ibag (jAltschul et After removing redundant sequences 
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and sequences with > 90% identity to any other identified sequences or the query sequence of Ibag, 
we obtain a set of 14 sequences of amylases. These 14 sequences are used to construct a phyloge- 
netic tree of alpha-amylase (Figure EK) • W e use the maximum-lik e liho od method implemented in 
the MOLPHY package for tree construction ( Adachi and Haseeawal . 19961 . 

We then calculate the similarity scoring matrices from the estimated values of the rate matrix. 
Because a priori we do not know how far a particular candidate protein is separated in evolution 
from the query template protein, we calculate a series of 300 scoring matrices, each characteriz- 
ing the residue substitution pattern at a different time separation, ranging from 1 time unit to 
300 time unit. Here 1 time unit repr esents the time required for 1 substitution per 100 residues 
( Davhoff. Schwartz, and Orcutt . 1978h. We use the Smith- Waterman algorithm as implemented in 
the SSEARCH method of Fasta (jPearsoiil . 199ll l with each of the 300 scoring matrices in turn to 
align sequence patterns of candidate binding s urfaces from a database of >2 m il lion p rotein surface 
pockets contained in the pvSoar database ( Binkowski. Freeman, and Lianj . |200j). We use an 
-E-value of 10~^ as the threshold to decide if a matched surface pocket is a hit. Surfaces similar 
to the query binding pocket identified (with values < 10~^) are then subjected to further shape 
analysis, where those that cannot be superimposed to the residues of the query surface pattern 
at a statistically signi ficant level (p-value < 0.01) by e i ther th e coordinate RMSD measure or the 
orientational RMSD ( Binkowski. Adamian. and Lianel. 2003al)measure are exclud ed. The p- value 
is estimated using methods developed in ({Binkowski. Adamian. and Liangl . 2003al ) . 

A total of 58 PDB structures are found to have similar binding surfaces to that of Ibag, and hence 
are predicted as amylases. All of them turn out to have the same E.G. number of 3.2.1.1 as that 
of Ibag. We repeat this study but using a different amylase structure as the query protein. Using 
the functional pocket on lbg9, we found 48 PDB structures with E.G. 3.2.1.1 labels. The union of 
the results from these two searches gives 69 PDB structures with E.G. 3. 2. 1.1 labels. Examples of 
matched protein surfaces are shown in Figure [3 



Comparison with others. We compare our results with other studies. The Enzyme Structure 
Database (Esd) (www . ebi . ac . uk/thornt on-srv) collects protein structures for enzymes contained 
in the Enzyme databank ( Bairochl . 19931 ) for study. Here we take the ESD database as the gold- 
standard, and all true answers are contained in this human curated database. There are 75 PDB 
entries with enzyme class label E.G. 3. 2. 1.1 in Esd (version Oct 2004). Out of the 75 structures, 
our method discovered 69 PDB structures (no redundancy) using Ibag and lbg9 as queries. 

We also compare our results with those obtained from a database search using sequence align- 
ment methods. Using the Smith- Waterman algorithm as implemented in Ssearch of the Fasta 
package with th e default BlosumSO matrix, only 32 st ructures are identified as alpha amylase 
(see Table 2 in ( Binkowski. Adamian. and LiangL 2003al )). When using Psi-blast and the NR 
database with default parameters, an value threshold of 10"'^, and < 10 iterations to generate 
position-specific weight matrices, 65 structures (no redundancy) among the 75 known structures of 
alpha-amylase are found after combining results from queries with Ibag and lbg9. 

We next tested search results using the standard Jtt matrix instead of the estimated protein- 
specific and surface-specific matrix. In this case, we find 52 hits instead of 58 using Ibag as the 
query protein, and 8 hits instead of 48 using lbg9 as the query protein. 

Our method differs from SSEARCH ( Pearsonl . 1998h in two aspects: first, we use short sequence 
patterns generated from the binding surface of the protein structure instead of the full protein 
sequences. Second, we use the customized scoring matrix derived from the estimated evolution- 
ary model instead of the standard Blosum matrix. Psi-blast differs from our method in that it 
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also uses full length primary sequences and it effectively uses an empirical model of position spe- 
cific weight matrices to extract evolutionary information from a set of multiple aligned sequences, 
without the benefit of using a phylogeny and an explicit parametric model. 

Compared to the Fasta sequence alignment and Psi-blast search, our method can identify 
more alpha amylases. In addition, because we directly detect binding surface similarity instead 
of global sequence similarity, our prediction has stronger implications for inferring functional re- 
lationships. In contrast, Psi-BLAST search does not provide information about which residues are 
important for function. We have also shown that our estimated rate matrix works much better than 
the generic precomputed Jtt matrix, especially when the query template surface has a relatively 
small size. 



Table 2 

Detecting functionally related proteins. 



Protein Family 


Query 


Pocket'' 


pocket 


Our" 


Results by 


Results by 


ESD-^ 




structure 


id 


length 


result 


Psi-blast'' 




(true answers) 


EC 3.2.1.1 


Ibag 


60 


18 


58 


45 


52 


75 


EC 3.2.1.1 


lbg9 


61 


12 


48 


21 


8 


75 


EC 3.8.1.2 


lqh9 


23 


16 


8 


8 


3 


8 


EC 3.5.4.4 


2ada 


49 


28 


23 


17 


19 


23 


EC 4.2.1.11 


lebh 


122 


35 


22 


20 


19 


22 


EC 1.13.11.39 


lkw9 


34 


23 


18 


16 


18 


18 



"Pocket id could be referenced through CastP database (cast.engr.uic.edu). 

"Our results are obtained from querying with a template binding surface and customize scoring matrices. 
'^The true answers are taken as those recorded in the human curated ESD database. 
"^Results using Psi-blast sequence alignment. 
"^Results using our method with a standard Jtt matrix. 



To examine whether our method works for proteins of other functions, we repeated our test 
using four additional enzymes of different biochemical functions. These are: 2,3-dihydroxybiphenyl 
dioxygenase (E.G. 1.13.11.39), adenosine deaminase (E.G. 3.5.4.4), 2-haloacid dehalogenase (E.G. 
3.8.1.2), and phosphopyruvate hydratase (E.G. 4.2.1.11). As shown in Table|21 we are able to find 
all other protein structures of the same E.G. numbers contained in the ESD in all four cases. Our 
results are better than using Psi-blast or using the Jtt matrix. 



Discussion 



We have developed a Bayesian method for estimating residue subs tituti on rates. B ayesian inference 
of phylo geny was independently introduced by Yang and R annala (Il997l'l Mau fit Qi (Il999l'l and Li et 
al (|2nnnl ) . Bayesian methods ha ve found wide applications dHuelsenbeck et al. , 2001 ■ l2002b ) , includ- 
ing host-parasite co-speciation dlluels enbeck. Rannal a. and Yand . Il997l ). estimation of divergence 
times of species (IThorne. Kishin9. and Pai nter. 1998.') , simultan eous sequence alignment and phy - 
logeny estimation (|Mitchisonl . ll999l ). inference of ancestral states ( Huelsenbeck and Bollbacj . 2001 ) , 
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and d etermination of the root position of a phylogenetic tree ( Huelsenbeck. Bollback. and Levinel . 
2002al ). Similar to others, our approach is based on the Markov chain Monte Carlo sampling tech- 



nique. Although we are not aware of any other studies using Bayesian models for the direct estima- 
tion of substitution rates between amino aci d residues, our approach is a natural extension of exist- 
ing w ork on maximum likelihood estimation ( Goldman and Yan j . 19941 : Yang. Nielsen, and Hasegawal . 
19981 ^ of codo n substitution rates for amino acid residues, and other studies based on Bayesian statis- 




Thorne. Kishino. and Painter 



tical analysis (^Yang and R,annala*1997*;*Huelsenbeck. E,annala ._ and Yane 
[1998 : Mau. Newton, and Larget 1999: Huel senbeck et al.. 2001 

In this work, we studied the substitution of residues using amino acid sequences instead of nu- 
cleotide sequences. In our model, the parameters of the continuous time Markov process are the 
rates of direct substitutions between residues. A more established model of residue substitution is 
that of the substitutions between codons. This model can provide rich information about detailed 
mechanisms of molecular evolution. For example, the differential effects of tr ansition vs. transver- 
sion and synonymous vs. nonsvnonymou s substitutions all can be modeled ( Goldman and Yangl . 

Our choice of the current model of direct residue substi- 



199 



Yang. Nielsen, and Hasegawi 



jn ymou s 
l ll998h . 



tution is based on two practical considerations. First, for the application of predicting protein func- 
tions, we find it is far easier to gather amino acid residue sequences than nucleotide sequences when 
large scale database searches are carried out. Second, when using scoring matrices derived from 
substitution rates to detect remotely related proteins, ar nino acid seque i ices give far better results 
in sensitivity and specificity than nucleotide sequences (|Pearsonl . Il998l : iLio and Goldmanl . Il999i ) . 
An interesting future study would be one that is based on codon substitution models, which will 
help to identify possible bias in the current approach, where the effects of transition/transversion 
and synonymous/nonsynonymous substitutions are not considered. 

It has long been rec ognized that the evolution ary divergence of protein structures is far slower 
than that of sequences (jChothia and Les5 . Il98fih . Since physical constraints on protein structure 
would give rise to associations be tween patterns of amino acid replacement and protein structure 
( Koshi and Goldsteinl . 199fil . Il997i ) , the substitution rates of residues in differe nt secondary struc- 
tural environments and of different solvent accessibility have been well-studied (iLesk and Chothial. 



1; iGoldman. Thorne. and JonesL Il996al. Il998al: iThorne. Goldman, and JonesLll99fil:lBustamante. Townsend. am 
I). In a pioneering work, Thorne et al. developed an evolutionary model that combines secondary 
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structure with residue replacement, and showed that the incor poration of secondary stru cture signif- 
icantly improves the evolutionary model for sucrose synthase (jThorne. Goldman, and .lones^ 119961 ) . 
The impact of secondary structure and solvent accessibility on protein evolution was further stud- 
ied in detail using a hidden Markov model in ( Goldman. Thorne. and Jonesl . 1998al ). Additional 
work showed that a n accurate evolution model ca n in tu rn lead t o accurate prediction of protein 
secondary structure (jGoldman. Thorne. and Jones 1996a : iLio et al..,.1998.) . Parisi and Echave have 
further developed a simulation model to study the effects of selection of structural perturbation on 
the site-dependent subs titution rates of residues ( Parisi and Echav^ 2f)01bl : Robinson et al. 200,4 
Parisi and Echav3 2005h . These studies highlighted the importance of physical constraints on pro- 



tein evolution. 

Our work is a continuation in the direction of assessing substitution rates of residues in different 
structural environments, but with an important novel development. Here we proposed to study 
substitution rates of residues in a new structural category, namely, residues from local binding 
surface regions that are directly implicated in biochemical functions. Since a fundamental goal 
of stud ying protein evolu t ion is to understand how biological functions emer 



appear dGu and GiJ 1200.4 IVogel et al.ll2004l : iLecomte. Vuletich. and LeskI 



200f 



^e, evolve, and dis- 
), estimation of the 



19 



substitution rates of residues on functional surfaces is critically important. 

Proteins are selected to fold to carry out necessary cellular roles. In many cases, they are 
involved in binding interactions with other molecules. Surface binding pockets and voids are 
therefore the most relevant structural region s, which can be computed using exact algorithms 
( Liang. Edelsbrunner. and Woodwardl . 1998bl ). A unique advantage of this novel structural cate- 
gory is that it allows better separation of residues experiencing selection pressure due to the con- 
straints of biochemical functions from those due to the constraints for physical structural integrity. 
In contrast, the structural categories of residues in different secondary structural environments and 
solvent accessibility are more suited to study how substitutions are related to protein stability, be- 
cause they inevitably will include many conservation patterns due to the requirements of structural 
stability. 

For example, solvent accessibility directly relates to the driving force of hvdrophobic effects for 
protein folding , and secondary structures are essential for m aintaining protein stability (lDil]lll99n| : 
Dill et aLlll995l'l. The structural categorizat ions developed in ( Goldman. Thorne. and Jonesl . 119967 



uni et B,iMi\)V)ri]. ine structural categorizat ions aeveiopea m ijiTOiaman. inorne. ana .ionesi . iit^Hns , 
1998al : iThorne. Goldman, and Jonea . Il996l ) are well-suited for studying how protein evolution is 
constrained by physical interactions important for protein folding and stability. For example, 
the patterns of hydrophobic residues in the buried interior, polar residues on the surface, and 
small residues in /3-turns are all due to structural constraints and do not have direct functional 
implications. Indeed, the study of Koshi and Goldstein found strong correlation between transfer 
free energy AG of amino acid residues, a physico-ch emical property of amino acid solvation energy, 
and residue substitution rates ( Koshi and Goldsteinl . 11996). The categorization of residues proposed 
here are designed for studying how protein evolution is constrained by function (i.e., protein- 
ligand/substrate binding and protein-protein interactions). To our best knowledge, this is the first 
study in which a structure-derived category amenable for computation is proposed that separates 
residues selected for function from residues selected for stability. 

Our results showed that residues located in functional pockets have different substitution rates 
from residues in the remaining parts of the protein. The differences are mostly due to residues such 
as His and Asp that are known to be important for protein function. All of these region-specific 
substitution rate matrices are different from the precomputed Blosum matrix. 

It is informative to examine the difference of the substitution rates in the Jtt matrix and the 
binding site specific rate matrices we estimated. The Jtt matrix was developed using a very large 
database of sequences, and the overall composition -Djrprp of amino acid residues is very different 
from the composition D of the binding surfaces. Hence, the conserved residues, or the values of the 
diagonal elements sa of the substitution matrix, are very different. This is reflected in the different 
residue composition for functional surfaces and the full protein sequence (Figure |^). This would 
result in different overall patterns of substitutions. For substitution after a long time interval, it is 
necessary to estimate the off-diagonal elements Sij with some accuracy, as the substitutions would 
accumulate with time, and identifying remotely related binding surfaces becomes difficult. 

It is challenging to estimate substitution rates of amino acid residues in a local region. The 
number of residue positions for a specific region may be small, and the available sequences in the 
phylogenetic tree may also be limited. It is unlikely that all 189 independent substitution rates of 
the 20 X 20 matrix can be estimated accurately when only limited data is available. In this study, 
we can only estimate substitution rates for occurring pairs, namely, substitutions between residues 
that occur in the same position in different sequences. However, for applications such as inferring 
protein functions by matching similar binding surfaces, our results show that the constructed scoring 
matrices are very effective. It is likely that the substitutions (or lack thereof) that occur in the 
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sampled data for a specific region are the most important ones in overall patterns of evolution of 
residues in this specific region. For example, the most important features in a functional pocket on 
a protein structure are the conserved residues. Accurate estimation of the diagonal rates (su) is 
therefore the most important task. Because conserved residues appear in relatively higher frequency, 
they often can be estimated well. If some substitutions never occur in the sampled data, they 
probably are not important and setting their values to a baseline offset value such as that from a 
uniform prior would be reasonable. We have carried out detailed studies on identifying functionally 
related alpha amylases and other enzymes by querying with one or more template binding surface 
and assessing similarity using scoring matrices derived from the estimated rates. As shown in 
Table El our approach works very well in practice. In a control study, we assign random values to 
the matrix entries, which conform to the normalization condition. Scoring matrices derived from 
this randomized rate matrix are ineffective, and we were not able to find any functionally related 
proteins for any example listed in Table [21 

One might wish to estimate a 20 x 20 substitution rate matrix that is specific to an individual 
site or position in the sequence. However, this would require a very large amount of data that are 
not available in practice. In addition, it is conceivable that estimating site specific rate matrices 
may not be necessary or possible. For example, if a residue is critical for protein folding stability, it 
might be conserved through all stages of the evolution, and there is no variation at this particular 
position of the amino acid sequences. In such cases, it is difficult to estimate a full substitution 
matrix for this site. In our approach, we essentially pool residues that are located in the same 
region together, and assume they experience similar evolutionary pressure. 

Ultimately, the effectiveness of incorporating structural information in phylogenetic analysis 
and evolutionary models can be tested on the criterion whether it in turn helps to understand 
the organization principles of protein structures and their biochemical functions. As indicated by 
successful applications in protein function prediction reported here, structure-based phylogenetic 
analysis provides a powerful framework for studying significant problems in structural biology. 

Our method benefits from existing computational techniques. Without the mathe matical theorv 
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1998bl ). strategies for shape similarity assessment (|Binkowski. Adamian. and Liang . 2003a ), as well 
as demonstrated success of th ese co mputational techniques jL iang. Edelsbru nner. and Wood wardl . 
1998bl : Li. Hu. and Lianj . 12003. : .Binkowski. Adamian. and Liang. .20D3_a : Li and Liana . .2005. ) , the 
novel category of functionally important surface pockets would not be possible. 

There are, however, some limitations in our method. If the number of homologous sequences is 
too few (< 10) or the length of the functionally important binding pocket is too short (< 8 residues), 
there will not be enough data for parameter estimation. Another limitation of our study is the 
assumption that all sites in a protein evolve according to the same rate matrix along all branches 
of the phylogenetic tree. Although simulation studies and applications indicate that the estimated 
rates are sufficiently accurate for the purpose of detecting functionally related protein surfaces, this 
assumption may not be realistic fo r studving detailed evolutionary history and mechanisms for a 
specific protein ( Yanj . Il993l . Il994bl : iHuelsenbeck and Nielsenl . Il999l : iFelsensteinl . [^01 ) . 

Our simulation study is simple and cannot provide a full picture of the estimation errors under 
different biological conditions. The focus of our simulation study is to assess how estimation error is 
affected by the length of a functional pocket. In our method, the proper and accurate construction 
of a high quality phylogenetic tree is essential. We find it important to carefully select amino 
acid sequences to ensure quality multiple sequence alignments, where few gaps are introduced and 
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proteins of different divergence are well represented. In our practice, we find that the maximum 
likelihood estimator of MOLPHY works well with amino acid sequences for constructing phylogenetic 
trees. The effects of the assumption that the input phylogenetic tree is optimal, as well as the effects 
of different input branch lengths on the accuracy of estimation, needs further detailed studies. Our 
preliminary results suggest that the estimated scoring matrices for protein functional sites and 
database search results are insensitive to small perturbations in the phylogenetic tree and the 
branch lengths. For instance, in a database search of alpha amylase, we are able to use different 
surface templates, each from a different protein structure with its own slightly different phylogenetic 
tree and branch lengths. Our results show that the sets of functionally related proteins are nearly 
identical (data not shown). 

Furthermore, the choice of a prior is an important and complex issue in Bayesian statistics. We 
assume that the likelihood function dominates and the information from the prior is limited. More 
detailed study is needed for a clear understanding of the influence of the choice of prior. 

In summary, we have extended existing continuous time Markov models of residue substitution 
from that of codon-codon replacement to a model of residue-residue replacement. We have also 
developed a novel structural category of local surface regions that is well-suited for studying the 
evolution of protein functions. We have implemented an effective Bayesian Monte Carlo method 
that can successfully estimate the substitution rates of residues in small local structural regions 
in proteins. In addition, we have developed a database search method using scoring matrices 
derived from estimated residue substitution rates. Our results in solving the fundamental problem 
of inferring protein functions from protein structures show very encouraging results. There are 
other novel technical developments. For example, we find it necessary to develop an efficient move 
set for rapid mixing in Monte Carlo estimation of substitution rates. We have also explored how 
reliability of estimated substitution rates depends on the size of the local region. As indicated by 
the successful applications reported here, we believe that phylogenetic analysis of protein evolution 
provides powerful tools for the important bioinformatic task of protein function prediction. 
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